Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(deps): update dependency sentence_transformers to v3 #1291

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

renovate[bot]
Copy link
Contributor

@renovate renovate bot commented Jun 11, 2024

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
sentence_transformers ==2.7.0 -> ==3.2.1 age adoption passing confidence

Release Notes

UKPLab/sentence-transformers (sentence_transformers)

v3.2.1: - Patch CLIP loading, small ONNX fix, compatibility with other libraries

Compare Source

This patch release fixes some small bugs, such as related to loading CLIP models, automatic model card generation issues, and ensuring compatibility with third party libraries.

Install this version with

### Training + Inference
pip install sentence-transformers[train]==3.2.1

### Inference only, use one of:
pip install sentence-transformers==3.2.1
pip install sentence-transformers[onnx-gpu]==3.2.1
pip install sentence-transformers[onnx]==3.2.1
pip install sentence-transformers[openvino]==3.2.1

Fixing Loading non-Transformer models

In v3.2.0, a non-Transformer based model (e.g. CLIP) would not load correctly if the model was saved in the root of the model repository/directory. This has been resolved in #​3007.

Throw error if StaticEmbedding-based model is finetuned with incompatible losses

The following losses are not compatible with StaticEmbedding-based models:

  • CachedGISTEmbedLoss
  • CachedMultipleNegativesRankingLoss
  • CachedMultipleNegativesSymmetricRankingLoss
  • DenoisingAutoEncoderLoss
  • GISTEmbedLoss

An error is now thrown when one of these are used with a StaticEmbedding-based model. I recommend using MultipleNegativesRankingLoss to finetune these models, e.g. as in https://huggingface.co/tomaarsen/static-bert-uncased-gooaq.
Note: to get good performance, you must use much higher learning rates than otherwise. In my experiments, 2e-1 worked well.

Patch ONNX model when the model uses output_hidden_states

For example, this script used to fail, but passes now:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "distiluse-base-multilingual-cased",
    backend="onnx",
    model_kwargs={"provider": "CPUExecutionProvider"},
)

sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape)

All changes

New Contributors

Full Changelog: UKPLab/sentence-transformers@v3.2.0...v3.2.1

v3.2.0: - ONNX and OpenVINO backends offering 2-3x speedup; Static Embeddings offering 50x-500x speedups at ~10-20% performance cost

Compare Source

This release introduces 2 new efficient computing backends for SentenceTransformer models: ONNX and OpenVINO + optimization & quantization, allowing for speedups up to 2x-3x; static embeddings via Model2Vec allowing for lightning-fast models (i.e., 50x-500x speedups) at a ~10%-20% performance cost; and various small improvements and fixes.

Install this version with

### Training + Inference
pip install sentence-transformers[train]==3.2.0

### Inference only, use one of:
pip install sentence-transformers==3.2.0
pip install sentence-transformers[onnx-gpu]==3.2.0
pip install sentence-transformers[onnx]==3.2.0
pip install sentence-transformers[openvino]==3.2.0

Faster ONNX and OpenVINO Backends for SentenceTransformer (#​2712)

Introducing a new backend keyword argument to the SentenceTransformer initialization, allowing values of "torch" (default), "onnx", and "openvino".
These come with new installations:

pip install sentence-transformers[onnx-gpu]

### or ONNX for CPU only:
pip install sentence-transformers[onnx]

### or
pip install sentence-transformers[openvino]

It's as simple as:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")

sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)

If you specify a backend and your model repository or directory contains an ONNX/OpenVINO model file, it will automatically be used! And if your model repository or directory doesn't have one already, an ONNX/OpenVINO model will be automatically exported. Just remember to model.push_to_hub or model.save_pretrained into the same model repository or directory to avoid having to re-export the model every time.

All keyword arguments passed via model_kwargs will be passed on to ORTModel.from_pretrained or OVBaseModel.from_pretrained. The most useful arguments are:

  • provider: (Only if backend="onnx") ONNX Runtime provider to use for loading the model, e.g. "CPUExecutionProvider" . See https://onnxruntime.ai/docs/execution-providers/ for possible providers. If not specified, the strongest provider (E.g. "CUDAExecutionProvider") will be used.
  • file_name: The name of the ONNX file to load. If not specified, will default to "model.onnx" or otherwise "onnx/model.onnx" for ONNX, and "openvino_model.xml" and "openvino/openvino_model.xml" for OpenVINO. This argument is useful for specifying optimized or quantized models.
  • export: A boolean flag specifying whether the model will be exported. If not provided, export will be set to True if the model repository or directory does not already contain an ONNX or OpenVINO model.

For example:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "all-MiniLM-L6-v2",
	backend="onnx",
	model_kwargs={
		"file_name": "model_O3.onnx",
		"provider": "CPUExecutionProvider",
	}
)

sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
Benchmarks

We ran benchmarks for CPU and GPU, averaging findings across 4 models of various sizes, 3 datasets, and numerous batch sizes. Here are the findings:

These findings resulted in these recommendations:
image

For GPU, you can expect 2x speedup with fp16 at no cost, and for CPU you can expect ~2.5x speedup at a cost of 0.4% accuracy.

ONNX Optimization and Quantization

In addition to exporting default ONNX and OpenVINO models, we also introduce 2 helper methods for optimizing and quantizing ONNX models:

Optimization

export_optimized_onnx_model: This function uses Optimum to implement several optimizations in the ONNX model, ranging from basic optimizations to approximations and mixed precision. Read about the 4 default options here. This function accepts:

  • model A SentenceTransformer model loaded with backend="onnx".
  • optimization_config: "O1", "O2", "O3", or "O4" from 🤗 Optimum or a custom OptimizationConfig instance.
  • model_name_or_path: The directory or model repository where the optimized model will be saved.
  • push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
  • create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for optimizing models of repositories that you don't have write access to.
  • file_suffix: The suffix to add to the optimized model file name. Will use the optimization_config string or "optimized" if not set.

The usage is like this:

from sentence_transformers import SentenceTransformer, export_optimized_onnx_model

onnx_model = SentenceTransformer("BAAI/bge-large-en-v1.5", backend="onnx")
export_optimized_onnx_model(
	model=onnx_model,
	optimization_config="O4",
	model_name_or_path="BAAI/bge-large-en-v1.5",
	push_to_hub=True,
	create_pr=True,
)

After which you can load the model with:

from sentence_transformers import SentenceTransformer

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = SentenceTransformer(
   "BAAI/bge-large-en-v1.5",
   backend="onnx",
   model_kwargs={"file_name": "onnx/model_O4.onnx"},
   revision=f"refs/pr/{pull_request_nr}"
)

or when it gets merged:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
   "BAAI/bge-large-en-v1.5",
   backend="onnx",
   model_kwargs={"file_name": "onnx/model_O4.onnx"},
)
Quantization

export_dynamic_quantized_onnx_model: This function uses Optimum to quantize the ONNX model to int8, also allowing for hardware-specific optimizations. This results in impressive speedups for CPUs. In my findings, each of the default quantization configuration options gave approximately the same performance improvements. This function accepts

  • model A SentenceTransformer model loaded with backend="onnx".
  • quantization_config: "arm64", "avx2", "avx512", or "avx512_vnni" representing quantization configurations from AutoQuantizationConfig, or an QuantizationConfig instance.
  • model_name_or_path: The directory or model repository where the optimized model will be saved.
  • push_to_hub: Whether the push the exported model to the hub with model_name_or_path as the repository name. If False, the model will be saved in the directory specified with model_name_or_path.
  • create_pr: If push_to_hub, then this denotes whether a pull request is created rather than pushing the model directly to the repository. Very useful for quantizing models of repositories that you don't have write access to.
  • file_suffix: The suffix to add to the optimized model file name. Will use the quantization_config string or e.g. "int8_quantized" if not set.

The usage is like this:

from sentence_transformers import SentenceTransformer, export_quantized_onnx_model

onnx_model = SentenceTransformer("BAAI/bge-large-en-v1.5", backend="onnx")
export_quantized_onnx_model(
	model=onnx_model,
	quantization_config="avx512",
	model_name_or_path="BAAI/bge-large-en-v1.5",
	push_to_hub=True,
	create_pr=True,
)

After which you can load the model with:

from sentence_transformers import SentenceTransformer

pull_request_nr = 2 # TODO: Update this to the number of your pull request
model = SentenceTransformer(
   "BAAI/bge-large-en-v1.5",
   backend="onnx",
   model_kwargs={"file_name": "onnx/model_qint8_avx512.onnx"},
   revision=f"refs/pr/{pull_request_nr}"
)

or when it gets merged:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
   "BAAI/bge-large-en-v1.5",
   backend="onnx",
   model_kwargs={"file_name": "onnx/model_qint8_avx512.onnx"},
)

Lightning-Fast Static Embeddings via Model2Vec (#​2961)

If ONNX or OpenVINO isn't fast enough for you yet, then perhaps you'll enjoy Static Embeddings. These embeddings are a bit akin to GLoVe or Word2vec, i.e. they're bags of token embeddings that are summed together to create text embeddings, allowing for lightning-fast embeddings that don't require any neural networks.

However, these Static Embeddings are created in different ways. For example:

  1. Distillation via the Model2Vec technique. This projects allows you to distill any Sentence Transformer model into Static Embeddings. For example, distilling BAAI/bge-base-en-v1.5 resulted in a Static Embeddings Sentence Transformer model that reaches 87.5% of the performance of all-MiniLM-L6-v2 on MTEB (+ PEARL & WordSim) and 97.4% of the performance of all-MiniLM-L6-v2 on various classification benchmarks.
    You can initialize Static Embeddings via Model2Vec in two ways:

note: pip install model2vec is needed, but not for inference

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

Initialize a Sentence Transformer model with a static embedding from a pretrained model2vec model

static_embedding = StaticEmbedding.from_model2vec("minishlab/M2V_multilingual_output")
model = SentenceTransformer(modules=[static_embedding])

Encode some texts

queries = ["What is the capital of France?", "How many people live in the Netherlands?"]
documents = ["Paris is the capital of France", "The Netherlands has 17 million inhabitants"]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

Compute similarities

scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
"""
tensor([[0.8170, 0.3843],
        [0.3929, 0.5818]])
"""
```
* [`from_distillation`](https://sbert.net/docs/package_reference/sentence_transformer/models.html#sentence_transformers.models.StaticEmbedding.from_distillation): You can use the name of any Sentence Transformer model alongside some parameters (See [this docs](https://redirect.github.com/MinishLab/model2vec#distilling-a-model2vec-model) for more information) to perform the distillation yourself, without needing any dataset. On my device, this takes ~4s on a GPU and ~2 minutes on a CPU:
```python

note: pip install model2vec is needed, but not for inference

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

Initialize a Sentence Transformer model with a static embedding by distilling via model2vec

static_embedding = StaticEmbedding.from_distillation(
    "mixedbread-ai/mxbai-embed-large-v1",
    device="cuda",
    pca_dims=256,
    apply_zipf=True,
)
model = SentenceTransformer(modules=[static_embedding])

Encode some texts

queries = ["What is the capital of France?", "How many people live in the Netherlands?"]
documents = ["Paris is the capital of France", "The Netherlands has 17 million inhabitants"]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

Compute similarities

scores = model.similarity(query_embeddings, document_embeddings)
print(scores)
"""
tensor([[0.8430, 0.3271],
        [0.3213, 0.5861]])
"""
```
  1. Random initialization: Although this initialization needs finetuning, finetuning a Sentence Transformers model backed by StaticEmbedding is extremely fast. For example, I was able to finetune tomaarsen/static-bert-uncased-gooaq with MatryoshkaLoss & MultipleNegativesRankingLoss on the entire (3 million pairs) gooaq dataset in just 7 minutes. This model reaches a NDCG@10 of 79.33 on a hold-out set of 10k samples from gooaq, whereas e.g. BAAI/bge-base-en-v1.5 reaches 85.01 NDCG@10. In short, only 6.6% less performance for a model that's about 500x faster.
    That's not a typo: I can compute embeddings for about 14000 stsb sentences from per second on CPU, compared to about ~24 with BAAI/bge-base-en-v1.5, a.k.a. 625x faster.

[!NOTE]
You can save_pretrained and load these models like any other Sentence Transformer models, the StaticEmbedding initialization is only necessary when you're creating a new model.

  • Creation:
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.models import StaticEmbedding
    
    # Initialize a Sentence Transformer model with a static embedding from a pretrained model2vec model
    static_embedding = StaticEmbedding.from_distillation(
        "mixedbread-ai/mxbai-embed-large-v1",
        device="cuda",
        pca_dims=256,
        apply_zipf=True,
    )
    model = SentenceTransformer(modules=[static_embedding])
    model.save_pretrained("static-mxbai-embed-large-v1")
    # or
    # model.push_to_hub("tomaarsen/static-mxbai-embed-large-v1")
  • Inference:
    from sentence_transformers import SentenceTransformer
    
    # Initialize a Sentence Transformer model with a static embedding
    model = SentenceTransformer("static-mxbai-embed-large-v1")
    
    model.encode([...])

Small changes

  • The InformationRetrievalEvaluator now accepts query_prompt, query_prompt_name, corpus_prompt, and corpus_prompt_name arguments, useful if your model requires specific prompts for queries and/or documents for the best performance. (#​2951)
  • The mine_hard_negatives function now accepts anchor_column_name and positive_column_name for specifying which dataset columns will be used. If not specified, the first two columns are used, respectively. Additionally, the min_score parameter is added, ensuring that all mined negatives have a similarity score of at least min_score according to the chosen SentenceTransformer or CrossEncoder model. (#​2977)
  • If you're using multiple evaluators during training via SequentialEvaluator, e.g. multiple evaluators for different Matryoshka dimensions, then the order is now preserved in the training logs in the model card. Previously, they were sorted by name, resulting in weird orderings (e.g. "gooaq-1024", "gooaq-128", "gooaq-256", "gooaq-32", "gooaq-512", "gooaq-64") (#​2963)
  • CachedGISTEmbedLoss has been improved to support multiple negatives per sample, i.e. the loss now accepts data in the (anchor, positive, negative_1, …, negative_n) format. It is the third loss to support this format (see docs):

image

All changes

New Contributors

Special thanks to @​echarlaix for making the new backends possible due to some last-minute changes in optimum and optimum-intel.

Full Changelog: UKPLab/sentence-transformers@v3.1.1...v3.2.0

v3.1.1: - Patch hard negative mining & remove numpy<2 restriction

Compare Source

This patch release fixes hard negatives mining for models that don't automatically normalize their embeddings and it lifts the numpy<2 restriction that was previously required.

Install this version with

### Full installation:
pip install sentence-transformers[train]==3.1.1

### Inference only:
pip install sentence-transformers==3.1.1

Hard Negatives Mining Patch (#​2944)

The mine_hard_negatives utility introduced in the previous release would fail if use_faiss=True & the model does not automatically normalize its embeddings. This release patches that, allowing the utility to work with all Sentence Transformer models:

from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

### Load a Sentence Transformer model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1").bfloat16()

### Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train[:10000]")
print(dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 10000
})
"""

### Mine hard negatives
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    range_min=10,
    range_max=50,
    max_score=0.8,
    margin=0.1,
    num_negatives=5,
    sampling_strategy="random",
    batch_size=128,
    use_faiss=True,
)
'''
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:21<00:00,  3.51it/s]
Batches: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:03<00:00, 25.77it/s]
Querying FAISS index: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.98it/s]
Metric       Positive       Negative     Difference
Count          10,000         47,711
Mean           0.7600         0.5376         0.2299
Median         0.7673         0.5379         0.2274
Std            0.0658         0.0387         0.0629
Min            0.3858         0.3732         0.1044
25%            0.7219         0.5129         0.1833
50%            0.7673         0.5379         0.2274
75%            0.8058         0.5617         0.2724
Max            0.9341         0.7024         0.4780
Skipped 48770 potential negatives (9.56%) due to the margin of 0.1.
Could not find enough negatives for 2289 samples (4.58%). Consider adjusting the range_max, range_min, margin and max_score parameters if you'd like to find more valid negatives.
'''
print(dataset)
'''
Dataset({
    features: ['query', 'answer', 'negative'],
    num_rows: 47711
})
'''
print(dataset[0])
'''
{
    'query': 'where is the us navy base in japan located',
    'answer': 'United States Fleet Activities Yokosuka The United States Fleet Activities Yokosuka (横須賀海 軍施設, Yokosuka kaigunshisetsu) or Commander Fleet Activities Yokosuka (司令官艦隊活動横須賀, Shirei-kan kantai katsudō Yokosuka) is a United States Navy base in Yokosuka, Japan. Its mission is to maintain and operate base facilities for the logistic, recreational, administrative support and service of the U.S. Naval Forces Japan, Seventh Fleet and other operating forces assigned in the Western Pacific. CFAY is the largest strategically important U.S. naval installation in the western Pacific.[1] As of August 2013[update], it was commanded by Captain David Glenister.',
    'negative': "2011 Tōhoku earthquake and tsunami The earthquake took place at 14:46 JST (UTC 05:46) around 67\xa0km (42\xa0mi) from the nearest point on Japan's coastline, and initial estimates indicated the tsunami would have taken 10 to 30\xa0minutes to reach the areas first affected, and then areas farther north and south based on the geography of the coastline.[127][128] Just over an hour after the earthquake at 15:55 JST, a tsunami was observed flooding Sendai Airport, which is located near the coast of Miyagi Prefecture,[129][130] with waves sweeping away cars and planes and flooding various buildings as they traveled inland.[131][132] The impact of the tsunami in and around Sendai Airport was filmed by an NHK News helicopter, showing a number of vehicles on local roads trying to escape the approaching wave and being engulfed by it.[133] A 4-metre-high (13\xa0ft) tsunami hit Iwate Prefecture.[134] Wakabayashi Ward in Sendai was also particularly hard hit.[135] At least 101 designated tsunami evacuation sites were hit by the wave.[136]"
}
'''
dataset.push_to_hub("natural-questions-hard-negatives", "triplet")

Thanks to @​omarnj-lab for pointing out the bug to me.

Numpy restriction lifted (#​2937)

The v3.1.0 Sentence Transformers release required numpy<2 to prevent crashes on Windows. However, various third-parties (e.g. scipy) have now been recompiled & released, allowing the Windows tests to pass again.

If you experience the following snippet:

A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.0 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to downgrade to 'numpy<2' or try to upgrade the affected module. We expect that some modules will need time to support NumPy 2.

Then consider 1) upgrading the dependency from which the error occurred or 2) downgrading numpy to below v2:

pip install -U numpy<2

Thanks to @​kozlek for pointing this out to me and helping getting it resolved.

All changes

Full Changelog: UKPLab/sentence-transformers@v3.1.0...v3.1.1

v3.1.0: - Hard Negatives Mining utility; new loss function for symmetric tasks; streaming datasets; custom modules

Compare Source

This release introduces a hard negatives mining utility to get better models out of your data, a new strong loss function for symmetric tasks, training with streaming datasets to avoid having to store datasets fully on disk, custom modules to allow for more creativity from model authors, and many bug fixes, small additions and documentation improvements.

Install this version with

### Full installation:
pip install sentence-transformers[train]==3.1.0

### Inference only:
pip install sentence-transformers==3.1.0

[!WARNING]
Due to incompatibilities with Windows, we have set numpy<2 in the Sentence Transformers requirements. If you're not on Windows, you can still install numpy>=2 and everything should work as expected.

Hard Negatives Mining utility (#​2768, #​2848)

Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. For example:

  • Anchor: "are red pandas actually pandas?"
  • Positive: "Red pandas, like giant pandas, are bamboo eaters native to Asia's high forests. Despite these similarities and their shared name, the two species are not closely related. Red pandas are much smaller than giant pandas and are the only living member of their taxonomic family."
  • Hard negative: "The giant panda (Ailuropoda melanoleuca; Chinese: 大熊猫; pinyin: dàxióngmāo), also known as the panda bear or simply the panda, is a bear native to south central China."

These negatives are more difficult for a model to distinguish from the correct answer, leading to a stronger training signal and a stronger overall model when used with one of the Loss Functions that accepts (anchor, positive, negative) pairs such as the one above.

This release introduces a utility function called mine_hard_negatives that allows you to mine for these hard negatives given a (anchor, positive) dataset (and optionally a corpus of negative candidate texts).

It boasts the following features to give you fine-grained control over the similarity of the mined negatives relative to the anchor:

  • CrossEncoder rescoring for higher quality negative selection.
  • Skip the top $n$ negative candidates as these might be true positives.
  • Consider only the top $n$ negative candidates.
  • Skip negative candidates that are within some margin of the true similarity between anchor and positive.
  • Skip negative candidates whose similarity is larger than some max_score.
  • Two sampling strategies: pick the top negative candidates that satisfy the requirements, or pick them randomly.
  • FAISS index for searching for negative candidates.
  • Option to return data as triplets only, or as 2 + num_negatives-tuples.
from sentence_transformers.util import mine_hard_negatives
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

### Load a Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

### Load a dataset to mine hard negatives from
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
print(dataset)
"""
Dataset({
    features: ['query', 'answer'],
    num_rows: 100231
})
"""

### Mine hard negatives
dataset = mine_hard_negatives(
    dataset=dataset,
    model=model,
    range_min=10,
    range_max=50,
    max_score=0.8,
    margin=0.1,
    num_negatives=5,
    sampling_strategy="random",
    batch_size=128,
    use_faiss=True,
)
'''
Batches: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 588/588 [00:33<00:00, 17.37it/s]
Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 784/784 [00:07<00:00, 101.55it/s]
Querying FAISS index: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:07<00:00,  1.06s/it]
Metric       Positive       Negative     Difference
Count         100,231        460,725        460,725
Mean           0.6866         0.4133         0.2917
Median         0.7010         0.4059         0.2873
Std            0.1125         0.0673         0.1006
Min            0.0303         0.1638         0.1029
25%            0.6221         0.3649         0.2112
50%            0.7010         0.4059         0.2873
75%            0.7667         0.4561         0.3647
Max            0.9584         0.7362         0.7073
Skipped 882722 potential negatives (17.27%) due to the margin of 0.1.
Skipped 27 potential negatives (0.00%) due to the maximum score of 0.8.
Could not find enough negatives for 40430 samples (8.07%). Consider adjusting the range_max, range_min, margin and max_score parameters if you'd like to find more valid negatives.
'''
print(dataset)
'''
Dataset({
    features: ['query', 'answer', 'negative'],
    num_rows: 460725
})
'''
print(dataset[0])
'''
{
    'query': 'the first person to use the word geography was',
    'answer': 'History of geography The history of geography includes many histories of geography which have differed over time and between different cultural and political groups. In more recent developments, geography has become a distinct academic discipline. \'Geography\' derives from the Greek γεωγραφία – geographia,[1] a literal translation of which would be "to describe or write about the Earth". The first person to use the word "geography" was Eratosthenes (276–194 BC). However, there is evidence for recognizable practices of geography, such as cartography (or map-making) prior to the use of the term geography.',
    'negative': 'Terminology of the British Isles The word "Great" means "larger", in comparison with Brittany in modern-day France. One historical term for the peninsula in France that largely corresponds to the modern French province is Lesser or Little Britain. That region was settled by many British immigrants during the period of Anglo-Saxon migration into Britain, and named "Little Britain" by them. The French term "Bretagne" now refers to the French "Little Britain", not to the British "Great Britain", which in French is called Grande-Bretagne. In classical times, the Graeco-Roman geographer Ptolemy in his Almagest also called the larger island megale Brettania (great Britain). At that time, it was in contrast to the smaller island of Ireland, which he called mikra Brettania (little Britain).[62] In his later work Geography, Ptolemy refers to Great Britain as Albion and to Ireland as Iwernia. These "new" names were likely to have been the native names for the islands at the time. The earlier names, in contrast, were likely to have been coined before direct contact with local peoples was made.[63]'
}
'''
dataset.push_to_hub("natural-questions-hard-negatives", "triplet")

This dataset can immediately be used in conjunction with MultipleNegativesRankingLoss, likely resulting in a stronger model than if you had just used the natural-questions dataset outright.

Here are some example datasets that I created using this new function:

Big thanks to @​ChrisGeishauser and @​ArthurCamara for assisting with this feature.

Add CachedMultipleNegativesSymmetricRankingLoss loss function (#​2879)

Let's break this down:

  • MultipleNegativesRankingLoss (MNRL): Given (anchor, positive) text pairs or (anchor, positive, negative) text triplets, this loss trains for "Given an anchor (e.g. a query), which text out of a big lineup (all positives and negatives in the batch) is the true positive (e.g. the answer)?".
  • MultipleNegativesSymmetricRankingLoss (MNSRL): Adaptation of MNRL that adds a second loss term which means: "Given an positive (e.g. an summary), which text out of a big lineup (all anchors) is the true anchor (e.g. the full article)?". This is useful for symmetric tasks, such as clustering, classification, finding similar texts, and a bit less useful for asymmetric tasks such as question-answer retrieval.
  • CachedMultipleNegativesRankingLoss (CMNRL): Adaptation of MNRL such that the batch size can be increased to an arbitrary size at a flat 10-20% training speed cost. A higher batch size means a larger lineup for the model to find the true positive in, often resulting in a better training signal and model.

The v3.1 Sentence Transformers release now introduces a new loss: CachedMultipleNegativesSymmetricRankingLoss (CMNSRL), which combines both of the previous adaptations. The result is a loss adept at symmetric training tasks for which you can pick an arbitrarily large batch size. It is likely the strongest loss for Semantic Textual Similarity (STS) tasks in Sentence Transformers now.
Big thanks to @​madhavthaker1 for working to include it.

Streaming Dataset support (#​2792)

The v3.1 release introduces support for training with datasets.IterableDataset (Differences between Dataset and IterableDataset docs). This means that you can train without first downloading the full dataset to disk. For example:

from datasets import load_dataset

### Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)

### IterableDataset({
###     features: ['question', 'answer'],

###     n_shards: 2
### })

or

from datasets import IterableDataset, Value, Features

def dataset_generator_fn():

### Gather, fetch, load, or generate data here
    for ... in ...:
        yield ...

train_dataset = IterableDataset.from_generator(dataset_generator_fn)
train_dataset = train_dataset.cast(Features({'question': Value(dtype='string', id=None), 'answer': Value(dtype='string', id=None)}))

(Read more about Dataset features here)

For a full example of training with a streaming dataset, consider this script:

import logging
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)

### 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
    "microsoft/mpnet-base",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="MPNet base trained on GooAQ pairs",
    ),
)

name = "mpnet-base-gooaq-streaming"

### 2. Load a streaming dataset to finetune on
train_dataset = load_dataset("sentence-transformers/gooaq", split="train", streaming=True)

### 3. Define a loss function
loss = MultipleNegativesRankingLoss(model)

### 4. (Optional) Specify training arguments
train_batch_size = 64
args = SentenceTransformerTrainingArguments(

### Required parameter:
    output_dir=f"models/{name}",

### Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=train_batch_size,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch

### Optional tracking/debugging parameters:
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    logging_steps=250,
    logging_first_step=True,
    run_name=name,  # Will be used in W&B if `wandb` is installed
)

### 5. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

### 6. Save the trained model
model.save_pretrained(f"models/{name}/final")

### 7. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(name)

Advanced: Allow for Custom Modules (#​2773)

Sentence Transformer models consist of several modules that are executed sequentially. Most models consist of a Transformer module, a Pooling module, and perhaps a Dense and/or Normalize module. However, as of the v3.1 release, model authors can create their own modules by writing some custom modeling code. This code can be uploaded to the Hugging Face Hub alongside the model itself, after which users can load the model like normal.

This allows for authors to replace the Transformer module with one that includes model-specific quirks, or replace the Pooling module with an all-new pooling method. This even allows for multi-modal models as authors can customize the preprocessing of the first module.

jinaai/jina-clip-v1 is the first model to take advantage of this new feature, allowing you to encode both texts and images (via paths to local images or URLs) due to their custom preprocessing. Try it out yourself:

from sentence_transformers import SentenceTransformer

### Load the model; must use trust_remote_code=True to run the custom module
model = SentenceTransformer("jinaai/jina-clip-v1", trust_remote_code=True)

### Texts and images of blue and red cats to embed
sentences = ['A blue cat', 'A red cat']
image_urls = [
    'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
    'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]

### Embed the texts and images like normal
text_embeddings = model.encode(sentences)
image_embeddings = model.encode(image_urls)

### Compute similarity between text embeddings:
print(model.similarity(text_embeddings[0], text_embeddings[1]))

### tensor([[✅0.5636]])
### or cross-modal text and image embeddings:
print(model.similarity(text_embeddings, image_embeddings))

### tensor([[✅0.2906, ❌0.0569],
###         [❌0.1277, ✅0.2916]]

Additionally, model authors can take advantage of keyword argument passthrough. By updating the modules.json file to include a list of kwargs, e.g.:

[
  {
    "idx": 0,
    "name": "0",
    "path": "",
    "type": "custom_transformer.CustomTransformer",
    "kwargs": ["task_type"]
  },
  ...
]

then if a user provides the task_type keyword argument in model.encode, this value will be propagated to the forward of the custom module(s). This way, users can specify some custom functionality on the fly during inference time (as well as during load time via the model_kwargs option when initializing a SentenceTransformer model).

Update dependency versions (#​2757)

  • Restrict numpy<2.0.0 due to issues with torch and numpy interoperability on Windows.
  • Increment minimum transformers version to 4.38.0 & huggingface-hub to 0.19.3 to prevent a training crash related to the prefetch_factor option

Smaller Highlights

Features
Bug fixes
  • Prevent crash when encoding an empty list (#​2759)
  • Support training with GISTEmbedLoss with DataParallel (DP) and DataDistributedParallel (DDP) (#​2772)
  • Fix a bug in GroupByLabelBatchSampler resulting in some data not being used in training (#​2788)
  • Prevent crash if a datasets directory exists locally (#​2859)
  • Fix Matryoshka2dLoss not importing correctly (#​2907)
  • Resolve niche training bug with training if using multi-dataset, no-duplicates, and dataloader_drop_last=True (#​2877)
  • Fix torch_compile=True not working in the SentenceTransformersTrainingArguments: should now work for faster training (#​2884)
  • Fix SoftmaxLoss performing worse since v3.0 as a Linear layer was ignored by the optimizer (#​2881)
  • Fix trainer.train(resume_from_checkpoint="...") with custom models (i.e. trust_remote_code) (#​2918)
  • Fix the evaluation using the training batch size (#​2847)
  • Fix encoding when passing model_kwargs={"torch_dtype": torch.float16} with models that use Dense layers (#​2889)
Documentation

All changes


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate bot force-pushed the renovate/sentence_transformers-3.x branch from 07368a3 to 3c92933 Compare June 16, 2024 11:35
@renovate renovate bot force-pushed the renovate/sentence_transformers-3.x branch from 3c92933 to 7868afe Compare July 4, 2024 15:46
Copy link

sonarcloud bot commented Jul 4, 2024

@renovate renovate bot force-pushed the renovate/sentence_transformers-3.x branch from 7868afe to 7109494 Compare July 10, 2024 15:58
@renovate renovate bot force-pushed the renovate/sentence_transformers-3.x branch 2 times, most recently from fbfb287 to eaa847e Compare September 20, 2024 09:44
@renovate renovate bot force-pushed the renovate/sentence_transformers-3.x branch from eaa847e to 7e1a61c Compare October 10, 2024 18:37
@renovate renovate bot force-pushed the renovate/sentence_transformers-3.x branch from 7e1a61c to 834f49c Compare October 21, 2024 15:05
@renovate renovate bot force-pushed the renovate/sentence_transformers-3.x branch from 834f49c to cbe02a5 Compare October 22, 2024 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants